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Abstract 

Since the early Sixties and Seventies it has been 
known that the regular and context-free lan- 
guages are characterized by definability in the 
monadic second-order theory of certain struc- 
tures. More recently, these descriptive charac- 
terizations have been used to obtain complex- 
ity results for constraint- and principle-based 
theories of syntax and to provide a uniform 
model-theoretic framework for exploring the re- 
lationship between theories expressed in dis- 
parate formal terms. These results have been 
limited, to an extent, by the lack of descrip- 
tive characterizations of language classes be- 
yond the context-free. Recently, we have shown 
that tree-adjoining languages (in a mildly gener- 
alized form) can be characterized by recognition 
by automata operating on three-dimensional 
tree manifolds, a three-dimensional analog of 
trees. In this paper, we exploit these automata- 
theoretic results to obtain a characterization 
of the tree-adjoining languages by definability 
in the monadic second-order theory of these 
three-dimensional tree manifolds. This not only 
opens the way to extending the tools of model- 
theoretic syntax to the level of TALs, but pro- 
vides a highly flexible mechanism for defining 
TAGij in tcrmo of logical conotrainta. 



1 Introduction 



In the early Sixties Biichi ( 1960| ) and El- 
got ( 196 1[) established that a set of strings was 
regular iff it was definable in the weak monadic 
second-order theory of the natural numbers 
with successor (wSIS). In the early Seventies 
an extension to the context-free languages was 



* This is the full version of a paper to appear in the pro- 
ceedings of COLING-ACL'98 as a project note (Rogers, 
1998). 



obtained by Thatcher and Wright ( 1968|) and 
Doner ( 1970| ) who established that the CFLs 
were all and only the sets of strings forming the 
yield of sets of finite trees definable in the weak 
monadic second-order theory of multiple succes- 
sors (wSnS). These descriptive characterizations 
have natural application to constraint- and 
principle-based theories of syntax. We have em- 
ployed them in exploring the language-theoretic 



complexity of theories in GB ( Rogers, 1994; 
iRogers, 1997b|) and GPSG ( Rogers, 1997a| ) and 
have used these model-theoretic interpretations 
uniform framework in which to compare 



these formalisms ( Rogers, 1996 ). They have 
also provided a foundation for an approach 
to principle-based parsing via compilation into 
tree-automata (Morawietz and Cornell, 1997). 
Outside the realm of Computational Linguis- 
tics, these results have been employed in the- 
orem proving with applications to program and 



hardware verification (Henriksen et al., 1995 
Biehl et al., 199q ; [Kelb et al., 1997] ). The 



scope of each of these applications is limited 
to some extent, by the fact that there are no 
such descriptive characterizations of classes of 
languages beyond the context-free. As a result, 
there has been considerable interest in extend- 
ing the basic results (Monnich, 1997; |Volger 



1997 ) but, prior to the work reported here, the 



proposed extensions have not preserved the sim- 
plicity of the original results. 

Recently, in ( |Rogers, 1997c] ) , we introduced 
a class of labeled three-dimensional tree-like 
structures (three-dimensional tree manifolds — 
3-TM) which serve simultaneously as the 
derived and derivation structures of Tree 
Adjoining-Grammars (TAGs) in exactly the 
same way that labeled trees can serve as both 
derived and derivation structures for CFGs. We 
defined a class of automata over these struc- 



tures that are a natural generalization of tree- 
automata (which are, in turn, an analogous 
generalization of ordinary finite-state automata 
over strings) and showed that the class of tree 
manifolds recognized by these automata are ex- 
actly the class of tree manifolds generated by 
TAGs if one relaxes the usual requirement that 
the labels of the root and foot of an auxiliary 
tree and the label of the node at which it adjoins 
all be identical. 

Thus there are analogous classes of automata 
at the level of labeled three-dimensional tree 
manifolds, the level of labeled trees and at the 
level of strings (which can be understood as 
two- and one-dimensional tree manifolds) which 
recognize sets of structures that yield, respec- 
tively, the TALs, the CFLs, and the regular 
languages. Furthermore, the nature of the gen- 
eralization between each level and the next is 
simple enough that many results lift directly 
from one level to the next. In particular, we 
get that the recognizable sets at each level are 
closed under union, intersection, relative com- 
plement, projection, cylindrification, and de- 
terminization and that emptiness of the rec- 
ognizable sets is decidable. These are exactly 
the properties one needs to establish that rec- 
ognizability by the automata over a class of 
structures characterizes satisfiability of monadic 
second-order formulae in the language appropri- 
ate for that class. Thus, just as the proofs of clo- 
sure properties lift directly from one level to the 
next, Doner's and Thatcher and Wright's proofs 
that the recognizable sets of trees are char- 
acterized by definability in wSnS lift directly 
to a proof that the recognizable sets of three- 
dimensional tree manifolds are characterized by 
definability in their weak monadic second-order 
theory (which we will refer to as wSnT3). 

In this paper we carry out this program. In 
the next three sections we introduce 3-TMs 
and our uniform notion of automaton over tree 
manifolds of arbitrary (finite) dimension and 
sketch, as an example, proofs of closure un- 
der determinization, projection and cylindrifi- 
cation that are independent of the dimensional- 
ity. In Sections [5] and |6] we introduce wSnT3, 
the weak monadic second-order theory of re- 
branching 3-TM, and sketch the proof that 
the sets recognized by 3-TM automata are ex- 
actly the sets definable in wSnT3. This, when 



coupled with the characterization of TALs in 
Rogers ( p. 997c ), gives us our descriptive char- 
acterization of TALs: a set of strings is gener- 
ated by a TAG (modulo the generalization of 
Rogers (|1997c[ )) iff it is the (string) yield of a 
set of 3-TM definable in wSnT3. Finally, in 
Section |?] we look at how working in wSnT3 al- 
lows a potentially more natural means of defin- 
ing TALs and, in particular, a simplified treat- 
ment of constraints on modifiers in TAGs. 

2 Tree Manifolds 

Tree manifolds are a generalization to arbi- 
trary dimensions of Gorn's tree domains ( porn 



1967). A tree domain is a set of node address 
drawn from N* (that is, a set of strings of nat- 
ural numbers) in which e is the address of the 
root and the children of a node at address w oc- 
cur at addresses wO,wl, . . ., in left-to-right or- 
der. To be well formed, a tree domain must 
be downward closed wrt to domination, which 
corresponds to being prefix closed, and left sib- 
ling closed in the sense that if wi occurs then 
so does wj for all j < i. In generalizing these, 
we can define a one-dimensional analog as string 
domains: downward closed sets of natural num- 
bers interpreted as string addresses. From this 
point of view, the address of a node in a tree 
domain can be understood as the sequence of 
string addresses one follows in tracing the path 
from the root to that node. If we represent N 
in unary (with n represented as l n ) then the 
downward closure property of string domains 
becomes a form of prefix closure analogous to 
downward closure wrt domination in tree do- 
mains, tree domains become sequences of se- 
quences of 'l's, and the left-closure property of 
tree domains becomes a prefix closure property 
for the embedded sequences. 

Raising this to higher dimensions, we obtain, 
next, a class of structures in which each node 
expands into a (possibly empty) tree. A, three- 
dimensional tree manifold (3-TM), then, is set 
of sequences of tree addresses (that is, addresses 
of nodes in tree domains) tracing the paths from 
the root of one of these structures to each of 
the nodes in it. Again this must be down- 
ward closed wrt domination in the third dimen- 
sion, equivalently wrt prefix, the sets of tree ad- 
dresses labeling the children of any node must 
be downward closed wrt domination in the sec- 



ond dimension (again wrt to prefix), and the 
sets of string addresses labeling the children of 
any node in any of these trees must be down- 
ward closed wrt domination in the first dimen- 
sion (left-of, and, yet again, prefix) .[] Thus 3- 
TM, tree domains (2-TM), and string domains 
(1-TM) can be defined uniformly as d th -order 
sequences of 'l's which are hereditarily prefix 
closed. We will denote the set of all 3-TM as T d , 
so T 1 C "P(l*) is the set of all string domains, 
T 2 C V {{I*)*) the set of all tree domains, and 
T 3 C 7>(((1*)*)*) the set of all 3-TM, where 
"P(<S) is the power set of Sft 

For any alphabet S, a T,-labeled d- 
dimensional tree manifold is a pair (T, r) 
where T is a d-TM and r : T — > S is an 
assignment of labels in S to the nodes in T. 
We will denote the set of all S-labeled d-TM as 
T|. 

3 Tree Manifold Automata 

Mimicking the development of tree manifolds, 
we can define automata over labeled 3-TM as a 
generalization of automata over labeled tree do- 
mains which, in turn, can be understood as an 
analogous generalization of ordinary finite-state 
automata over strings (labeled string domains). 
A d-TM. automaton with state set Q and alpha- 
bet £ is a finite set: 

The interpretation of a tuple (a, q, T) G A d is 
that if a node of a d-TM is labeled a and T 
encodes the assignment of states to its children, 
then that node may be assigned state g.f] A run 

1 While this clearly iterates to obtain tree manifolds 
of any finite dimension, we are concerned only with 
the first three dimensions (four, counting points — zero- 
dimensi qnal tree manifo lds) . 

2 In (Rogers, 1997c) we constructed tree- manifolds 



from headed strings in order to obtain a unique tree as 
the two-dimensional yield of a 3-TM. Here we treat this 
as a matter of interpretation, keeping the simple notion 
of tree- manifold and moving the issue of headedness into 
the relational structures we build on them. 

3 This is a "bottom-up" interpretation. There is an 
analogous "top-down" interpretation, but for all d > 2 
automata that are deterministic under the top-down in- 
terpretation are strictly weaker than those that are non- 
deterministic, while those that are deterministic under 
the bottom-up interpretation are equivalent to the non- 
deterministic variety. It should be emphasized that the 
only place the distinction between top-down and bottom- 



of an d-TM automaton iona S-labeled d-TM 
T = (T, t) is an assignment r : T — > Q of states 
in Q to nodes in T in which each assignment 
is licensed by A. Note that this implies that 
a maximal node (wrt to the major dimension) 
labeled a may be assigned state q only if there is 
a tuple (a,q,e) G A d where e is the empty (d — 
1)-TM. If we let Qq C Q be any set of accepting 
states, then the set of (finite) S-labeled d-TM 
recognized by A, relative to Qq, is that set for 
which there is a run of A that assigns the root 
a state in Qq: 

A(Q ) ^ 

{T = (T, t) I T finite and 
3r : T — > Q such that 

r(s) G Qq and for all s G T 

(r(s),r( S ),(T,r)\ CHTyS) )GA} 
where Ch(T, s) = {w G T^ 1 ) | s ■ (w) G T} and 



( T > r ) \ch(T,s) = 
{Ch(T,s),{w 



r(s-{w)) | w G Ch(T,s)}) 



A set of d-TM is recognizable iff it is A(Qo) for 
some d-TM automaton A and set of accepting 
states Qo. 

4 Uniform Properties of 
Recognizable Sets 

The strength of the uniform definition of d- 
TM automata is that many, even most, prop- 
erties of the sets they recognize can be proved 
uniformly — independently of their dimension. 
For instance, let us say that the depth of a TM 
is the length of the longest sequence it includes 
(just the length of the top level sequence, inde- 
pendent of the length of the sequences it may 
contain). The branching factor of a TM at a 
given dimension is the maximum depth of the 



up arises is in the definition of determinism. These au- 
tomata are interpreted purely declaratively, as licensing 
assignments of states to nodes. 

4 In general, we will employ w and s in this manner 
where w denotes a sequence of some order and s denotes 
a sequence of sequences of the order of w (i.e., a sequence 
of the next higher order). Concatenation will always be 
interpreted as an operation on sequences of the same 
order. Thus, s ■ (iu) is a sequence of sequences in which 
the last sequence is w. We will also use t and v as we 
use s and w, and will employ p for sequences of the next 
higher order than s and t when needed. 



structures it contains in that dimension. The 
(overall) branching factor of a d-TM is the max- 
imum of its branching factors at all dimensions 
strictly less than d. For a 3-TM, then, the 
branching factor is the larger of the maximum 
depth of the trees it contains and the maximum 
length of the strings it contains. A TM is n- 
branching iff its branching factor is no greater 
than n. We will denote the set of all E-labeled, 
n-branching, d-TM as T^ ,a! . A d-TM automa- 
ton is deterministic with respect to a branching 
factor n (in the bottom-up sense) iff 

(Vo- G £,T G Tg rf_1 )(3!g G Q)[(a,q, T) G A] ft 

It is easy to show, using a standard subset- 
construction, that (bottom-up) determinism 
does not effect the recognizing power of d-TM 
automata of any dimension. Given A C E x 
Q x T^" 1 , let 

A C £ x P(Q) x 

d ^ f {{o-,Qi,(T,t')) I 

QiCQ, t':T^V(Q), 
q^Ql 

(3r:T^Q)[ 

(<7,g, (r,T» 6^A 
(VxGT)[r(x) Gr'(x)]], 

Qo = {ftCQ|Q in Qo/0}- 

It is easy to verify that .4 is deterministic and 
that ^l(Qo) = ^(Qo)- More importantly, while 
the dimension of the TM automaton parameter- 
izes the type of the objects manipulated by the 
proof, it has no effect on the way in which they 
are manipulated — the proof itself is essentially 
independent of the dimension. 

Proof of closure of recognizable sets under 
projection and cylindrification is even easier. A 
projection is any (usually many-to-one) surjec- 
tive map from one alphabet onto another. A 
cylindrification is an "inverse" projection. Let 
7r : £ — > £' be any projection, T = (T, r) a 
S-labeled d-TM and A an automaton over E- 
labeled d-TM. Then vr(T) d = (T, vr o r) and 

tt(A) ^ {{ir(a ), g, T) | (a, ?, T) G A}. 
5 The quantifier 3! should be read "exists exactly one". 



It is easy to see that 

TG A(Qo) vr(T) G ir(A)(Qo). 

Similarly, ifiCS'xQx T^ _1) let 

n-\A) d ^ {(a,q,T) \ (w(a),q,T) G A}. 

Then 7r(T) G -4(Q ) ^ T G tt-^XQo)- 

Similar uniform proofs can be obtained for 
closure of recognizable sets under Boolean op- 
erations and for decidability of emptiness. 

5 wSnT3 

We are now in a position to build relational 
structures on d-dimensional tree manifolds. Let 
be the complete n-branching d-TM — that in 
which every point has a child structure that has 
depth n in all its [d — 1) dimensions. Let 

T£ = <2* <1,<2,«3) 
where, for all x,y G T%: 

def / i 

x <3 y y = x-{s) 

x <2 U ^==> x = p ■ (s) and y = p ■ (s ■ (w)) 

x <\ y <^4> x = p ■ (s • (w)) and 
y = p ■ (s ■ (w ■ v}} 

where p G ((1*)*)*, a G (1*)*, w G 1*, u G 1+ 
(which is to say that x <j y iff x is the immediate 
predecessor of y in the i th -dimension). 

The weak monadic second-order language of 
T^ includes constants for each of the relations 
(we let them stand for themselves), the usual 
logical connectives, quantifiers and grouping 
symbols, and two countably infinite sets of vari- 
ables, one ranging over individuals (for which 
we employ lowercase) and one ranging over fi- 
nite subsets (for which we employ uppercase). 
If <p(xx, . . . , x n , X\, . . . , X m ) is a formula of this 
language with free variables among the Xi and 
Xj , then we will assert that it is satisfied in T^ 
by an assignment s (mapping the 'xj's to in- 
dividuals and l Xj's to finite subsets) with the 
notation 

A sentence is a formula with no free variables — 
formulae for which truth in T^ is not contin- 
gent on an assignment. The set of all sentences 



of this language that are satisfied by is the 
weak monadic second-order theory of T^, de- 
noted wSnT3.[] 

6 Definability in wSnT3 

A set T of S-labeled 3-TM is definable in wSnT3 
iff there is a formula ip t (Xt, l ff ) ff gs, with free 
variables among Xt (interpreted as the domain 
of a tree) and X a for each a £ S (interpreted 
as the set of cr-labeled points in T), such that 



(T, r) G T 

T n \=<P T 



[X t ^T,X a ^{p\r(p) = a}}. 



It should be reasonably easy to see how any 
recognizable set can be defined in this way. Sup- 
pose the i th tuple of 3-TM automaton A is 
(a,0, / 1 \ 1 )- A local (depth one in its ma- 
jor dimension) 3-TM (labeled with both £ and 
Q) is compatible with this iff its root satisfies 

(fi(x) = 

(3xi,x 2 ,x 3 )[X T (x 1 ) AX T (x 2 ) /\X T (x 3 )A 
X a {x) A X (x) A X x {x x ) A X (x 2 ) A Xi(x 3 ) A 
(Vy)[X T (y)^ 

(x <3 y <-> (y ~ xi V y ph x 2 V y « z 3 ) A 
a^i <2 y <-> (y ~ ^2 V y pa x 3 ) A 
^x 2 < 2 y A -ix 3 < 2 y A 
^2 <i y <-> y ~ a; 3 A 
^£3 <i y ) ] 

We can then require every node in Xt to be 
licensed by some tuple in A by requiring it to 
satisfy \/i[tpi{x)], the disjunction of such formu- 
lae for all tuples in A. All that remains is to 
require the root to be labeled with an accepting 
state and to "hide" the states by existentially 
binding them: 

(3X q ) qeQ (Vx)[ (X T (x) - VibiO*)]) A 

^(3y)[y < 3 x]^y qeQo [X g (x)})}. 

It is not hard to show that a S-labeled 3-TM T 
corresponds to a satisfying assignment for this 
formula iff there is a run of A on T which assigns 
an accepting state to the root. 

The proof that every set of trees definable in 
wSnT3 is recognizable, while a little more in- 
volved, is essentially a lift of the proofs of Doner 



6 wSlTl is equivalent to wSIS in the sense of interin- 
terpretability, as is wSlTd for all d. wSnT2 is interin- 
terpretable with wSnS for all n > 2. 



dl97q) and Thatcher and Wright (jl968|) . The 
initial step is to show that every formula in the 
language of wSnT3 can be reduced to equivalent 
formulae in which only set variables occur and 
which employ only the predicates X C Y (with 
the obvious interpretation) and X <$ Y (satis- 
fied iff X and Y are both singleton and the sole 
element of X stands in the appropriate relation 
to the sole element of Y). We can define, for 
instance, 

Empty(A) = (YY)[Y CX^XCY] 
and 

Singleton(A) = 

(YY)[Y C X -> (Empty(y) VIC7)] 

Then x <j y becomes 

Singleton(A) A Singleton(y) A X Y. 

It is easy to construct 3-TM automata (over 
the alphabet "P({A, Y})) which accept trees en- 
coding satisfying assignments for these atomic 
formulae. For example, assignments satisfying 
X <l 3 Y in T| are in A(2) for A: 

(0,0, T), Te{e, o /\}, 

({Y},1,T), Te{e, /\}, 

({X},2,T), Te{ o /\, /x^ Q /\}, 

<M,T>, Te{ o /\, 2 /\, o A a }, 

(a, 3, T), otherwise. 

The extension to arbitrary formulae (over these 
atomic formulae) can then be carried out by in- 
duction on the structure of the formulae using 
the closure properties of the recognizable sets. 

7 Defining TALs in wSnT3 

The signature of wSnT3 is inconvenient for ex- 
pressing linguistic constraints. In particular, 
one of the strengths of the model-theoretic ap- 
proach is the ability to define long-distance re- 
lationships without having to explicitly encode 
them in the labels of the intervening nodes. 
We can extend the immediate predecessor re- 
lations to relations corresponding to (proper) 
above (within the 3-TM), domination (within a 
tree), and precedence (within a set of siblings) 
using: 

x^y^x^yA (3X)[X(x) A X(y)A 

<yz)[X(z) ^(zrayV (3\z')[X(z') A z « z'})}}. 



Which simply asserts that there is a sequence 
of (at least two) points linearly ordered by <j in 
which x precedes 

To extend these through the entire structure 
we have to address the fact that the two dimen- 
sional yield of a 3-TM is not well defined — there 
is nothing that determines which leaf of the tree 
expanding a node dominates the subtree rooted 
at that node. To resolve this, we extend our 
structures to include a set H picking out exactly 
one head in each set of siblings, with the "foot" 
of a tree being that leaf reached from the root 
by a path of all heads. Given H, it is possible to 
define and <y , variations of dominance and 
precedence^ that are inherited by substructures 
in the appropriate way. Let: 

Spine 2 (:c) H(x)A 
(Vy)[y < 2 x^ (H(y) v --(3*0 [* <2 y\) 

and 

* dcf j- , , 

x Oj y <^=^> x <i y V x rs y. 
Then 

+ dcf 

(3x', y') [x' O3 x A y' O3 y A x' < 2 y' A 

(Vz)[(x' <3 z A z O3 x) — » Spine 2 (2;)] ] 

and 

4- def 

x<{ y 

(3x', y') [x' <f 2 x A y' <f 2 y A x' <i y'] 

At the same time, it is convenient to include 
the labels explicitly in the structures. A headed 
Sdabeled 3-TM, then, is a structure: 

{T, <i,<i,<f , H, Pa)i<i<3,aeT:, 

where T is a rooted, connected subset of T% for 
some n. 

With this signature it is easy to define the 
set of 3-TM that captures a TAG in the sense 
that their 2-dimensional yields — the set of max- 
imal points wrt <y , ordered by < 2 an d <i — form 
the set of trees derived by the TAG. Note that 
obligatory (OA) and null (NA) adjoining con- 
straints translate to a requirement that a node 
be (non-)maximal wrt <g . In our automata- 
theoretic interpretation of TAGs selective ad- 
joining (SA) constraints are encoded in the 



states. Here we can express them directly: a 
constraint specifying the modifier trees which 
may adjoin to an N node, for instance, can be 
stated as a condition on the label of the root 
node of trees immediately below N nodes. 

In general, of course, SA constraints depend 
not only on the attributes (the label) of a node, 
but also on the elementary tree in which it oc- 
curs and its position in that tree. Both of these 
conditions are actually expressions of the local 
context of the node. Here, again, we can ex- 
press such conditions directly — in terms of the 
relevant elements of the node's neighborhood. 
At least in some cases this seems likely to allow 
for a more general expression of the constraints, 
abstracting away from the irrelevant details of 
the context. 

Finally, there are circumstances in which the 
primitive locality of SA constraints in TAGs 
is inconvenient. Schabes and Shieber (1994), 



7 This is partly a consequence of the fact that assign- 
ments to X are required to be finite. 
8 Of course <g is just <3. 



for instance, suggest allowing multiple adjunc- 
tions of modifier trees to the same node on 
the grounds that selectional constraints hold be- 
tween the modified node and each of its modi- 
fiers but, if only a single adjunction may occur 
at the modified node, only the first tree that 
is adjoined will actually be local to that node. 
They point out that, while it is possible to pass 
these constraints through the tree by encoding 
them in the labels of the intervening nodes, such 
a solution can have wide ranging effects on the 
overall grammar. As we noted above, the ex- 
pression of such non-local constraints is one of 
the strengths of the model-theoretic approach. 
We can state them in a purely natural way — as 
a simple restriction on the types of the modifier 
trees which can occur below (in the <^ sense) 
the modified node. 

8 Conclusion 

We have obtained a descriptive characterization 
of the TALs via a generalization of existing char- 
acterizations of the CFLs and regular languages. 
These results extend the scope of the model- 
theoretic tools for obtaining language-theoretic 
complexity results for constraint- and principle- 
based theories of syntax to the TALs and, carry- 
ing the generalization to arbitrary dimensions, 
should extend it to cover a wide range of mildly 
context-sensitive language classes. Moreover, 
the generalization is natural enough that the 



results it provides should easily integrate with 
existing results employing the model-theoretic 
framework to illuminate relationships between 
theories. Finally, we believe that this character- 
ization provides an approach to defining TALs 
in a highly flexible and theoretically natural 
way. 
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